epiAneufinder is an algorithm used for calling Copy Number Variations (CNVs) from single-cell ATAC (scATAC) data. Single-cell open chromatin profiling via the single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) assay has become a mainstream measurement of open chromatin in single-cells. epiAneufinder exploits the read count information from scATAC-seq data to extract genome-wide copy number variations (CNVs) for each individual cell. It allows the addition of single-cell CNV information to scATAC-seq data, without the need of additional experiments, unlocking a layer of genomic variation which is otherwise unexplored.
The main function of epiAneufinder performs the following three steps:
It can be run as follows
epiAneufinder(input="../sample_data/sample.tsv",
outdir="epiAneufinder_results",
blacklist="../sample_data/hg38-blacklist.v2.bed",
windowSize=1e5,
genome="BSgenome.Hsapiens.UCSC.hg38",
exclude=c('chrX','chrY','chrM'),
reuse.existing=FALSE,
title_karyo="Karyogram of sample data",
ncores=4,
minFrags=20000,
gc_correction="quadratic",
minsizeCNV=0,
k=4,
plotKaryo=TRUE)
#> Warning in dir.create(outdir, recursive = TRUE):
#> 'epiAneufinder_results/epiAneufinder_results' already exists
#> Removing old file from the output folder
#> Subtracting Blacklist...
#> Adding Nucleotide Information...
#> 1 of 22
#> 2 of 22
#> 3 of 22
#> 4 of 22
#> 5 of 22
#> 6 of 22
#> 7 of 22
#> 8 of 22
#> 9 of 22
#> 10 of 22
#> 11 of 22
#> 12 of 22
#> 13 of 22
#> 14 of 22
#> 15 of 22
#> 16 of 22
#> 17 of 22
#> 18 of 22
#> 19 of 22
#> 20 of 22
#> 21 of 22
#> 22 of 22
#> [1] "Finished making windows successfully"
#> Obtaining the fragments tsv file
#> Getting Counts...
#> Counting reads from fragments/bed file ..
#> Count matrix with 16 cells and 26088 windows has been generated and will be saved as count_summary.rds
#> Filtering empty windows, 24558 windows remain.
#> Correcting for GC bias with a binned quadratic polynomial ...
#> Calculating distance AD
#> Successfully identified breakpoints
#> Successfully discarded irrelevant breakpoints
#> Successfully assigned gain-loss
#> A .tsv file with the results has been written to disk.
#> It contains the copy number states for each cell per bin.
#> 0 denotes 'Loss', 1 denotes 'Normal', 2 denotes 'Gain'.
#> Successfully plotted karyogramIn the example above, a fragment file is supplied, but different input formats are possible for EpiAneufinder. Alternatives are a directory with bam files per cell or an already generated count matrix. The file type is identified automatically. The fragment file needs to be a single file with the ending tsv or bed. For the count matrix, a directory with three files is required, called matrix.mtx, barcodes.tsv and peaks.bed. The bam files need all to be in one directory with a bam index file each.
The filtering options are dependent on the input file type. In case,
a fragment file is provided the parameter minFrags filters
cells with too less fragments. In case, a directory with bam files is
provided the parameter mapqFilter filters low quality
reads. For the option to provide directly a count matrix, no filtering
is currently implemented.
Note of caution We compared the results of CNV calling with the peak matrix compared to fragment file as input. We used the SNU601 dataset as in the publication and compared with the scWGS dataset as groundtruth. The correlation between the CNVs is worse when using the matrix as input (0.74 from 0.85). We loose reads that are not in peak regions so overall we get less coverage over the genome. The result is that we can miss some gains and losses that would otherwise have been identified with the fragment input. So we would suggest a bit of caution in the interpretation of the results, but the comparison with the groundtruth the method still gives valid results.
The counts per windows are biased by the GC content of the associated
genomic region, so that a GC correction step is essential. As this
correction can be come time-intensive, we offer the user several options
to balance runtime and accuracy in the parameter
gc_correction.
The original method proposed in our publication, called
loess is very accurate, but doesn’t scale well with large
datasets. A faster, however more approximate alternative is the method
called quadratic. Briefly, both versions fit a correction
factor to adjust for GC-dependent count deviations, but using two
different models. The option loess applies a LOESS model,
while quadratic bins the data based on the GC content and
fits a bin-wise quadratic polynomial model. Both versions are applied
separately on each cell. Detailed descriptions and formulas can be found
for loess in the epiAneufinder publication (Ramakrishnan et
al., NatComms 2023) and for quadratic in the aneuFinder
publication (Bakker et al, Genome Biology 2016).
A third alternative is to apply the LOESS function not separately on
each cell, but once fit a model on the pseudobulk aggregates across all
cells (using the median counts) and use this as a general correction
factor for each cell. This option is called bulk_loess.
In evaluations on different datasets, we validated that the overall
method performance differs only slightly depending on the chosen GC
method. loess and bulk_loess were slightly
more accurate overall. However, the option quadratic is the
fastest, followed by bulk_loess. We would recommend for
large datasets to apply quadratic (default option) or
bulk_loess and for small datasets loess.
The example is run on the genome hg38, but any other genome available
at BSgenome can be used correspondingly, after installing the matching
bioconductor package. Corresponding to the genome version, the correct
blacklist file need to be provided to the function and the
corrected naming of the excluded chromosomes (variable
exclude).
The window size parameter defines the region size for the CNV predictions, counts within each bin will be aggregated and one CNV annotation for each bin is obtained.
The minsizeCNV parameter regulates the minimal number of
consecutive bins required for a CNV annotation, i.e. each CNV will
contain at least minsizeCNV bins. Setting the variable to 0
disables the restriction that a CNV needs to have a certain size and
also CNVs with one bin can be called.
The k parameters regulates the number of iterations of
the segmentation algorithm. A maximum of 2^k CNVs can be
found per chromsomes in k iterations. A large
k parameter increases however the runtime substantially, so
we set the default value to 4.
The epiAneufinder package contains different downstream functions for a more in depth analysis of the CNV predictions, in particular to split the cells into different subclones and different additional plotting functions.
Subclones in the dataset can be identified based on hierarchical
clustering of the identified CNVs. The dendogram is cut at a chosen
depth tree_depth.
#Load result table
res_table<-read.table("epiAneufinder_results/epiAneufinder_results/results_table.tsv")
#Need to reformat column names as dash are replaced by points in R
colnames(res_table)<-gsub("\\.","-",colnames(res_table))
subclones<-split_subclones(res_table,tree_depth=4,
plot_tree=TRUE,
plot_path="epiAneufinder_results/epiAneufinder_results/subclones.pdf",
plot_width=4,
plot_height=3)
#Data frame with the subclone split
head(subclones)
#> cell subclone
#> cell-TCAGCTCTCGCGTTCT-1 cell-TCAGCTCTCGCGTTCT-1 1
#> cell-TACGCAAAGTCGTGAG-1 cell-TACGCAAAGTCGTGAG-1 2
#> cell-TAACTTCTCCCGGGTA-1 cell-TAACTTCTCCCGGGTA-1 3
#> cell-ACCATCCGTTCTGAAC-1 cell-ACCATCCGTTCTGAAC-1 3
#> cell-CTCAGCTAGAGACTCG-1 cell-CTCAGCTAGAGACTCG-1 3
#> cell-ACAATCGTCTCATCCG-1 cell-ACAATCGTCTCATCCG-1 4Additional annotations can be added to the karyograms generated by
epiAneufinder, for example cell type annotations or the subclonal
distribution identified by split_subclones.
We recommend to save the karyogram as a png, defined via the variable
plot_path, as pdf versions can get quite large.
#Load result table
res_table<-read.table("epiAneufinder_results/epiAneufinder_results/results_table.tsv")
#Need to reformat column names as dash are replaced by points in R
colnames(res_table)<-gsub("\\.","-",colnames(res_table))
annot_dt<-data.frame(cell=subclones$cell,
annot=paste0("Clone",subclones$subclone))
plot_karyo_annotated(res_table=res_table,
plot_path="epiAneufinder_results/epiAneufinder_results/karyo_annotated.png",
annot_dt=annot_dt)The predictions of epiAneufinder on individual cells can be further
explored by evaluating the count profile of each cell. The set of cells
for detailed investigation can be either subsampled randomly by
assigning an integer value to the parameter selected_cells,
then this number of cells are sample. Or a vector of barcodes can be
given directly to selected_cells.
outdir<-"epiAneufinder_results/epiAneufinder_results/"
#Version 1 - Selected a specific set of barcodes, e.g. based on subclones
cell_barcodes<-subclones$cell[subclones$subclone==1]
cell_barcodes
#> [1] "cell-TCAGCTCTCGCGTTCT-1"
plot_single_cell_profile(outdir,
threshold_blacklist_bins=0.85,
selected_cells=cell_barcodes,
plot_path=paste0(outdir,"individual_tracks_clone1.pdf"),
plot_width=25,
plot_height=15)
#> quartz_off_screen
#> 2
#Version 2 - Randomly subsampling a certain number of cells
plot_single_cell_profile(outdir,
threshold_blacklist_bins=0.85,
selected_cells=3,
plot_path=paste0(outdir,"individual_tracks_random_subsample.pdf"),
plot_width=25,
plot_height=15)
#> quartz_off_screen
#> 2